Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add vector optimization for loongarch64 #4242

Merged
merged 22 commits into from
Nov 11, 2022
Merged

Conversation

junchao-loongson
Copy link
Contributor

#### lsx on

loop_count = 10
num_threads = 4
powersave = 0
gpu_device = 0
cooling_down = 1
          squeezenet  min =   30.11  max =   30.25  avg =   30.17
     squeezenet_int8  min =   36.20  max =   36.77  avg =   36.39
           mobilenet  min =   54.16  max =   55.30  avg =   54.94
      mobilenet_int8  min =   73.63  max =   84.76  avg =   75.02
        mobilenet_v2  min =   29.86  max =   30.04  avg =   29.93
        mobilenet_v3  min =   29.87  max =   30.27  avg =   29.96
          shufflenet  min =   14.28  max =   14.43  avg =   14.33
       shufflenet_v2  min =   14.69  max =   15.15  avg =   14.78
             mnasnet  min =   33.19  max =   36.34  avg =   33.63
     proxylessnasnet  min =   41.12  max =   41.42  avg =   41.32
     efficientnet_b0  min =   56.44  max =   56.90  avg =   56.57
   efficientnetv2_b0  min =   57.82  max =   70.10  avg =   59.27
        regnety_400m  min =   42.30  max =   43.77  avg =   42.66
           blazeface  min =    3.90  max =    3.97  avg =    3.92
           googlenet  min =   96.57  max =   97.67  avg =   97.24
      googlenet_int8  min =  114.78  max =  125.97  avg =  116.31
            resnet18  min =   76.17  max =   81.27  avg =   77.84
       resnet18_int8  min =   90.56  max =  105.29  avg =   92.35
             alexnet  min =   64.05  max =   72.82  avg =   65.37
               vgg16  min =  485.09  max =  492.48  avg =  489.88
          vgg16_int8  min =  526.74  max =  551.45  avg =  533.22
            resnet50  min =  245.49  max =  278.99  avg =  257.55
       resnet50_int8  min =  295.99  max =  327.18  avg =  310.22
      squeezenet_ssd  min =   61.70  max =   62.46  avg =   62.09
 squeezenet_ssd_int8  min =   77.95  max =   89.45  avg =   79.49
       mobilenet_ssd  min =  111.67  max =  114.42  avg =  112.48
  mobilenet_ssd_int8  min =  146.82  max =  170.33  avg =  150.07
      mobilenet_yolo  min =  285.39  max =  299.10  avg =  288.71
  mobilenetv2_yolov3  min =  119.12  max =  131.47  avg =  120.98
         yolov4-tiny  min =  153.83  max =  174.90  avg =  160.86
           nanodet_m  min =   36.52  max =   75.02  avg =   40.53
    yolo-fastest-1.1  min =   16.16  max =   19.05  avg =   16.49
      yolo-fastestv2  min =   14.61  max =   14.87  avg =   14.76
  vision_transformer  min = 1652.83  max = 1672.56  avg = 1659.27
          FastestDet  min =   18.22  max =   21.26  avg =   18.58



#### lsx off

loop_count = 10
num_threads = 4
powersave = 0
gpu_device = 0
cooling_down = 1
          squeezenet  min =   30.01  max =   30.56  avg =   30.11
     squeezenet_int8  min =   42.79  max =   57.73  avg =   45.00
           mobilenet  min =   55.12  max =   55.93  avg =   55.66
      mobilenet_int8  min =   89.75  max =   92.61  avg =   90.19
        mobilenet_v2  min =   30.62  max =   33.51  avg =   31.00
        mobilenet_v3  min =   31.04  max =   31.30  avg =   31.18
          shufflenet  min =   14.54  max =   14.72  avg =   14.61
       shufflenet_v2  min =   14.88  max =   15.47  avg =   15.01
             mnasnet  min =   33.66  max =   33.91  avg =   33.78
     proxylessnasnet  min =   41.61  max =   51.03  avg =   42.71
     efficientnet_b0  min =   58.00  max =   69.22  avg =   59.28
   efficientnetv2_b0  min =   59.64  max =   59.99  avg =   59.81
        regnety_400m  min =   42.73  max =   43.06  avg =   42.92
           blazeface  min =    3.93  max =    4.01  avg =    3.95
           googlenet  min =   97.71  max =  101.34  avg =   98.36
      googlenet_int8  min =  138.93  max =  182.00  avg =  145.42
            resnet18  min =   81.82  max =   84.90  avg =   82.62
       resnet18_int8  min =  108.41  max =  118.75  avg =  109.66
             alexnet  min =   67.48  max =   71.12  avg =   68.89
               vgg16  min =  480.30  max =  503.93  avg =  488.02
          vgg16_int8  min =  609.66  max =  628.12  avg =  620.36
            resnet50  min =  245.73  max =  250.32  avg =  247.84
       resnet50_int8  min =  357.05  max =  389.35  avg =  362.26
      squeezenet_ssd  min =   62.36  max =   73.35  avg =   63.66
 squeezenet_ssd_int8  min =   86.24  max =   89.02  avg =   87.41
       mobilenet_ssd  min =  111.42  max =  119.37  avg =  112.79
  mobilenet_ssd_int8  min =  178.72  max =  188.27  avg =  180.56
      mobilenet_yolo  min =  297.02  max =  305.32  avg =  299.69
  mobilenetv2_yolov3  min =  122.81  max =  124.30  avg =  123.32
         yolov4-tiny  min =  150.51  max =  156.30  avg =  152.37
           nanodet_m  min =   36.21  max =   36.57  avg =   36.34
    yolo-fastest-1.1  min =   15.78  max =   27.00  avg =   16.96
      yolo-fastestv2  min =   14.69  max =   14.95  avg =   14.81
  vision_transformer  min = 6423.16  max = 6451.22  avg = 6435.01
          FastestDet  min =   18.16  max =   18.46  avg =   18.32

打开 -DNCNN_BUILD_TESTS=ON 编译后运行tests未发现错误

@tencent-adm
Copy link
Member

tencent-adm commented Oct 8, 2022

CLA assistant check
All committers have signed the CLA.

@tpoisonooo
Copy link
Contributor

tql,然而没有加 CI

@nihui
Copy link
Member

nihui commented Oct 8, 2022

贴一个 3A4000 的数据,看起来 3A5000 反而更慢了啊

https://github.com/Tencent/ncnn/blob/master/benchmark/README.md#loongson-3a4000-gs464v-18ghz--4-with-msa128

root@3A4K:~/Desktop/ncnn-20220420/ncnn-20220420/build/benchmark$ ./benchncnn 
loop_count = 4
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 1
          squeezenet  min =   18.31  max =   18.97  avg =   18.64
     squeezenet_int8  min =   22.11  max =   35.58  avg =   25.60
           mobilenet  min =   28.07  max =   29.68  avg =   28.64
      mobilenet_int8  min =   34.10  max =  110.13  avg =   57.77
        mobilenet_v2  min =   20.73  max =   21.48  avg =   21.09
        mobilenet_v3  min =   19.92  max =   20.11  avg =   20.02
          shufflenet  min =   13.25  max =   13.98  avg =   13.51
       shufflenet_v2  min =   12.67  max =   12.95  avg =   12.87
             mnasnet  min =   20.04  max =   20.63  avg =   20.37
     proxylessnasnet  min =   23.90  max =   24.62  avg =   24.25
     efficientnet_b0  min =   38.09  max =   56.57  avg =   43.08
   efficientnetv2_b0  min =   41.14  max =   41.82  avg =   41.36
        regnety_400m  min =   36.19  max =   37.52  avg =   36.79
           blazeface  min =    4.05  max =    4.51  avg =    4.24
           googlenet  min =   74.61  max =   87.59  avg =   78.16
      googlenet_int8  min =   85.53  max =   87.06  avg =   86.27
            resnet18  min =   64.90  max =   71.13  avg =   67.04
       resnet18_int8  min =   60.56  max =   72.30  avg =   63.62
             alexnet  min =   74.92  max =   80.70  avg =   76.49
               vgg16  min =  335.14  max =  349.20  avg =  340.92
          vgg16_int8  min =  299.33  max =  371.58  avg =  318.36
            resnet50  min =  148.97  max =  240.90  avg =  176.92
       resnet50_int8  min =  161.41  max =  256.27  avg =  186.67
      squeezenet_ssd  min =   59.74  max =   60.25  avg =   59.92
 squeezenet_ssd_int8  min =   59.38  max =  140.09  avg =   79.84
       mobilenet_ssd  min =   59.61  max =   61.20  avg =   60.63
  mobilenet_ssd_int8  min =   71.35  max =  171.46  avg =  108.97
      mobilenet_yolo  min =  176.17  max =  262.16  avg =  201.31
  mobilenetv2_yolov3  min =   79.15  max =   87.97  avg =   81.50
         yolov4-tiny  min =  113.99  max =  117.35  avg =  115.06
           nanodet_m  min =   26.27  max =   27.11  avg =   26.63
    yolo-fastest-1.1  min =   11.65  max =  117.04  avg =   38.22
      yolo-fastestv2  min =   12.03  max =   12.40  avg =   12.16

CMakeLists.txt Outdated Show resolved Hide resolved
src/cpu.cpp Outdated Show resolved Hide resolved
src/layer/loongarch64/msa_mathfun.h Outdated Show resolved Hide resolved
src/layer/loongarch64/loongson_mmi.h Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
src/layer/loongarch64/absval_loongarch64.cpp Outdated Show resolved Hide resolved
src/layer/loongarch64/absval_loongarch64.cpp Outdated Show resolved Hide resolved
@junchao-loongson
Copy link
Contributor Author

之前顺利的异常是因为我把宏定义无意中替换了,现在改回来发现问题有点多,给我点时间我再改改

@junchao-loongson
Copy link
Contributor Author

> # cat do_test.sh                                                                             [±master]

for ncnn_test in `ls test_*`
do
	 echo "----- "$ncnn_test
	 if test $ncnn_test != "test_reduction"
	 then
		./$ncnn_test
	 fi
done
> # bash do_test.sh                                                                                                [±master]
----- test_absval
----- test_batchnorm
----- test_bias
----- test_binaryop
----- test_bnll
----- test_c_api
----- test_cast
----- test_clip
----- test_concat
----- test_convolution
value not match  at c:4 d:0 h:0 w:0    expect 0.754376 but got 1.000000
test_layer_cpu failed
test_layer Convolution failed use_packing_layout=0 use_fp16_packed=0 use_fp16_storage=0 use_fp16_arithmetic=0 use_shader_pack8=0 use_bf16_storage=0 use_image_storage=0 use_sgemm_convolution=1 use_winograd_convolution=1
test_convolution_int8 failed w=9 h=7 c=7 outch=7 kernel=1 dilation=1 stride=1 pad=0 bias=1 requant=0 act=4 actparams=[-0.136048,0.266064]
----- test_convolution1d
----- test_convolution3d
----- test_convolutiondepthwise
value not match  at c:0 d:0 h:0 w:0    expect 0.930540 but got 0.500000
test_layer_cpu failed
test_layer ConvolutionDepthWise failed use_packing_layout=0 use_fp16_packed=0 use_fp16_storage=0 use_fp16_arithmetic=0 use_shader_pack8=0 use_bf16_storage=0 use_image_storage=0 use_sgemm_convolution=1 use_winograd_convolution=1
test_convolutiondepthwise_int8 failed w=15 h=7 c=8 outch=8 kernel=3 dilation=1 stride=1 pad=1 bias=0 group=2 requant=0 act=4 actparams=[-0.370383,0.139109]
----- test_convolutiondepthwise1d
----- test_convolutiondepthwise3d
----- test_cpu
----- test_crop
----- test_deconvolution
----- test_deconvolution1d
----- test_deconvolution3d
----- test_deconvolutiondepthwise
----- test_deconvolutiondepthwise1d
----- test_deconvolutiondepthwise3d
----- test_deepcopy
----- test_deformableconv2d
----- test_dequantize
----- test_dropout
----- test_einsum
----- test_eltwise
----- test_elu
----- test_expanddims
----- test_flatten
----- test_gelu
----- test_gemm
----- test_groupnorm
----- test_gru
----- test_hardsigmoid
----- test_hardswish
----- test_innerproduct
----- test_instancenorm
----- test_interp
----- test_layernorm
----- test_lrn
----- test_lstm
----- test_matmul
----- test_mat_pixel
----- test_mat_pixel_affine
----- test_mat_pixel_drawing
----- test_mat_pixel_resize
----- test_mat_pixel_rotate
----- test_memorydata
----- test_mish
----- test_multiheadattention
----- test_noop
----- test_normalize
----- test_packing
----- test_padding
----- test_permute
----- test_pixelshuffle
----- test_pooling
----- test_pooling1d
----- test_pooling3d
----- test_power
----- test_prelu
----- test_priorbox
----- test_quantize
----- test_reduction
----- test_relu
----- test_reorg
----- test_requantize
value not match  at c:1 d:0 h:0 w:0    expect 26.000000 but got -2.000000
test_layer_cpu failed
test_layer Requantize failed use_packing_layout=1 use_fp16_packed=0 use_fp16_storage=0 use_fp16_arithmetic=0 use_shader_pack8=0 use_bf16_storage=0 use_image_storage=0 use_sgemm_convolution=0 use_winograd_convolution=0
test_requantize failed a.dims=3 a=(7 9 12) scale_in_data_size=1 scale_out_data_size=1 bias_data_size=12 act=0 actparams=[0.000000,0.000000]
----- test_reshape
----- test_rnn
----- test_roialign
----- test_roipooling
----- test_scale
----- test_selu
----- test_shufflechannel
----- test_sigmoid
----- test_slice
----- test_softmax
----- test_softplus
----- test_squeeze
----- test_squeezenet
----- test_swish
----- test_tanh
----- test_tile
----- test_unaryop
----- test_yolov3detectionoutput

现在测试还有3个错误,大佬看看

其中madd msub vbitsel 我临时修改lsxintrin.h 交换了下参数顺序以保证正确。后续我改下layer/loongarch 下代码

@junchao-loongson
Copy link
Contributor Author

junchao-loongson commented Oct 27, 2022

ncnn-2
ncnn-1

单元测试通过,以上图片为性能测试对比,由于我手里机器有限,没有找到两台主频相同的3A5000和3A4000做测试

测试命令:
../build/benchmark/benchncnn 10 $(nproc) 0 0

3A5000配置:

CPU:  2.5G
Loongson-3A5000-7A1000-1w-V0.1-CRB
内存:8G*2

3A4000配置:

CPU 1.8G
内存 8G*2  DDR4  Speed: 2133 MT/s

有相近配置的3A4000的大佬可以可以跑一个性能测试对比下,我这台3A4000有点拉跨

@Yoh-Z
Copy link
Contributor

Yoh-Z commented Oct 27, 2022

tql

CMakeLists.txt Outdated Show resolved Hide resolved
src/cpu.cpp Outdated Show resolved Hide resolved
src/cpu.cpp Show resolved Hide resolved
src/cpu.h Outdated Show resolved Hide resolved
{
__builtin_prefetch(ptr + 16);
__m128i _p = __lsx_vld(ptr, 0);
v4f32 _outp = (v4f32)__lsx_vbitclri_w(_p, 31);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__lsx_vst accepts __m128i type, no v4f32 casting needed
this comment applies all through the patch

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只能尽量使用_m128i,__lsx_vf 开头的函数要求v4f32类型参数,还得把__m128i 转到v4f32

src/layer/loongarch/sigmoid_loongarch.cpp Outdated Show resolved Hide resolved
src/layer/loongarch/msa_mathfun.h Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Oct 29, 2022

Codecov Report

Merging #4242 (a29f701) into master (e7eadca) will decrease coverage by 2.74%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master    #4242      +/-   ##
==========================================
- Coverage   94.44%   91.70%   -2.75%     
==========================================
  Files         750      783      +33     
  Lines      179375   184371    +4996     
==========================================
- Hits       169417   169071     -346     
- Misses       9958    15300    +5342     
Impacted Files Coverage Δ
src/layer.cpp 46.00% <ø> (ø)
src/mat.h 89.87% <ø> (+0.05%) ⬆️
src/cpu.cpp 58.43% <66.66%> (-3.69%) ⬇️
src/layer/x86/deformableconv2d_x86.cpp 0.00% <0.00%> (-97.92%) ⬇️
src/layer/deformableconv2d.cpp 0.00% <0.00%> (-97.20%) ⬇️
src/layer/arm/innerproduct_arm_asimdhp.cpp 94.72% <0.00%> (-3.19%) ⬇️
src/layer/x86/innerproduct_x86.cpp 97.12% <0.00%> (-2.23%) ⬇️
src/allocator.cpp 75.79% <0.00%> (-1.26%) ⬇️
src/layer/x86/convolution_winograd_dot_pack16.h 96.89% <0.00%> (-0.86%) ⬇️
... and 568 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

cmake/ncnn_add_layer.cmake Outdated Show resolved Hide resolved
cmake/ncnn_add_layer.cmake Show resolved Hide resolved
src/cpu.h Outdated Show resolved Hide resolved
src/mat.h Outdated Show resolved Hide resolved
src/mat.h Outdated Show resolved Hide resolved
src/layer_registry.h.in Outdated Show resolved Hide resolved
@nihui nihui merged commit 279222c into Tencent:master Nov 11, 2022
@nihui
Copy link
Member

nihui commented Nov 11, 2022

Thanks for your contribution !

csukuangfj added a commit to csukuangfj/ncnn that referenced this pull request Dec 1, 2022
* remove duplicated newline (Tencent#4187)

* remove duplicated newline (Tencent#4188)

* optmize softmax arm neon (Tencent#4171)

* [docs] Fix typo (Tencent#4201)

* [Prelu x86] Finish intrinsic with elempack merged (Tencent#4177)

* changed size of images for pretty formatting of page (Tencent#4193)

* [Gelu x86] Finish intrinsic with elempack merged(fast version) (Tencent#4144)

* Finish the gelu x86 intrinsics
* Finish the fast tanh x86 simd impl

* Ignore .xmake directory (Tencent#4212)

* Bump pypa/cibuildwheel from 2.9.0 to 2.10.1 (Tencent#4207)

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.9.0 to 2.10.1.
- [Release notes](https://github.com/pypa/cibuildwheel/releases)
- [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md)
- [Commits](pypa/cibuildwheel@v2.9.0...v2.10.1)

---
updated-dependencies:
- dependency-name: pypa/cibuildwheel
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* style: space alignment (Tencent#4217)

* Ignore CMakeSettings.json, the Visual Studio CMake schema file (Tencent#4228)

* RVV: use new interface for segment load/store & change word_type to size_t&add clang ci (part Tencent#4100) (Tencent#4118)

* RVV: use size_t for vl

* RVV: replace vsseg.v tuple type by using regex

-----

search:
vsseg([1-9])e(8|16|32)_v_(f|i|u)\2m(1|2|4|8)x\1\(([ -~]+), vcreate_\3\2m\4x\1\(([ -~]+)\), vl\);

substitute by:
vsseg$1e$2_v_$3$2m$4($5, $6, vl);

* RVV: replace vssseg.v tuple types by using regex

---

search:
vssseg([1-9])e(8|16|32)_v_f\2m1x\1\(([ -~]+), vcreate_f\2m1x\1\(([ -~]+)\), vl\);

substitute by:
vssseg$1e$2_v_f$2m1($3, $4, vl);

* RVV: replace vlseg.v tuple types in load/store

* RVV: replace vloxseg2ei32.v tuple types

* RVV: add a wrapper for old compilers

* RVV: add segment load/store wrapper in pakcing

* RVV: fix cmake test

* RVV: make clang happy by dropping VLAs in sgemm

* RVV: add clang cmake toolchain configure

* RVV: add clang ci, riscv64-unknown-linux-gnu

Co-authored-by: thelastlin <thelastlin@users.noreply.github.com>
Co-authored-by: nihui <shuizhuyuanluo@126.com>

* Bump pypa/cibuildwheel from 2.10.1 to 2.10.2 (Tencent#4220)

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.10.1 to 2.10.2.
- [Release notes](https://github.com/pypa/cibuildwheel/releases)
- [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md)
- [Commits](pypa/cibuildwheel@v2.10.1...v2.10.2)

---
updated-dependencies:
- dependency-name: pypa/cibuildwheel
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* add c906 build ci (Tencent#4232)

* Add benchmark result of T-Head TH1520 (Tencent#4240)

`cpuinfo`: 

```
isa             : rv64imafdcvsu
mmu             : sv39
cpu-freq                : 1.848Ghz
cpu-icache              : 64KB
cpu-dcache              : 64KB
cpu-l2cache             : 1MB
cpu-tlb         : 1024 4-ways
cpu-cacheline           : 64Bytes
cpu-vector              : 0.7.1
```

Compiled with `-DCMAKE_TOOLCHAIN_FILE=../toolchains/c910-v240.toolchain.cmake -DCMAKE_BUILD_TYPE=release -DNCNN_OPENMP=OFF -DNCNN_THREADS=OFF -DNCNN_RUNTIME_CPU=OFF -DNCNN_RVV=ON -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON` 

Seems much worse than expected 🤔

* fix param parsing issue when layer/blob name exceeds 255 (Tencent#4236)

* fix param parsing issue when layer/blob name exceeds 255

* apply code-format changes

Co-authored-by: ZhangGe6 <ZhangGe6@users.noreply.github.com>

* Memory Pool Improvement For Variadic Sized Inputs (Tencent#4190)

* Simple miss count for better space efficiency

* Simple double ended greedy;

* Add size drop threshold setter;

* set workspace allocator cr to zero as we had some sort of recylcing capability :P

Co-authored-by: LinHeLurking <LinHeLurking@users.noreply.github.com>
Co-authored-by: nihuini <nihuini@tencent.com>

* docs: disable fp16 when wrong results encountered caused by overflow (Tencent#4248)

* pnnx math operation (Tencent#4251)

* more stricter armv7 fp16 and armv84 bf16 compiler check, fix Tencent#4147 fix Tencent#4222 (Tencent#4247)

* modified the param axes of expanddims in modelwriter (Tencent#4259)

* Add TH1520 (4*C910V) toolchain support.  (Tencent#4267)

* implement lstm proj_size (Tencent#4263)

* Optimize x86 DeformableConv2D (Tencent#4128)

* fix compile warning with gcc 9.1.0 including simplestl.h file (Tencent#4274)

* fix compile warning with gcc 9.1.0 including simplestl.h file

* apply code-format changes

Co-authored-by: veahow <veahow@users.noreply.github.com>

* add benchmark for rk3588 on rock5b (Tencent#4275)

* linux-x64-cpu-gcc on tencent ci

* implement layer feature disabled bit (Tencent#4278)

* add elu vulkan operator (Tencent#4280)

* fix tencent ci (Tencent#4277)

* implement GLU and pnnx conversion (Tencent#4283)

* Bump pypa/cibuildwheel from 2.10.2 to 2.11.1 (Tencent#4271)

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.10.2 to 2.11.1.
- [Release notes](https://github.com/pypa/cibuildwheel/releases)
- [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md)
- [Commits](pypa/cibuildwheel@v2.10.2...v2.11.1)

---
updated-dependencies:
- dependency-name: pypa/cibuildwheel
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix pnnx softmax/normalize/slice negative axis conversion to ncnn (Tencent#4284)

* pnnx glu batchindex aware conversion (Tencent#4285)

* 1. Fix typo in readme (Tencent#4287)

* x86 sse2/avx2 optimization for convolution sgemm/winograd int8 family (Tencent#4286)

* pnnx skip dynamic size evaluation (Tencent#4291)

* Fix linux build error(Tencent#4265) (Tencent#4294)

Co-authored-by: wangyu <786794414@qq.com>

* general cpu feature detection on macos/ios, enable bf16 and i8mm on a15 a16 and m2 (Tencent#4300)

* x86 unified fc fp32/fp16s (Tencent#4303)

* more fma
* more transpose utility function

* Bump pypa/cibuildwheel from 2.11.1 to 2.11.2 (Tencent#4308)

Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.11.1 to 2.11.2.
- [Release notes](https://github.com/pypa/cibuildwheel/releases)
- [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md)
- [Commits](pypa/cibuildwheel@v2.11.1...v2.11.2)

---
updated-dependencies:
- dependency-name: pypa/cibuildwheel
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* pnnx pytorch 1.13 (Tencent#4314)

* fix Tencent#4315 (Tencent#4316)

* get_physical_cpu_count api family (Tencent#4302)

* get_physical_cpu_count api family

* set default to physical big cpu

* always treat smt core as big core

* is_smt_cpu

* get max freq mhz on windows

* windows thread affinity

* groupnorm 1d/2d/4d (Tencent#4312)

* fix slice end index, fix fp16 model weight alignment (Tencent#4317)

* tencent ci test-coverage pnnx (Tencent#4305)

* RVV: BatchNorm with fp16s(a) support (Tencent#4075)

* RVV: InstanceNorm with fp16s(a) support (Tencent#4078)

* fix ci pnnx build

* fold new_full and full_like (Tencent#4323)

* pnnx convert nn.Softmax2d (Tencent#4324)

* pnnx convert fold unfold (Tencent#4325)

* support yolov5 6.2 (Tencent#4328)

* implement ncnn fold and unfold (Tencent#4326)

* pnnx load gpu torchscript and reset device (Tencent#4330)

* fix:pnnx-softmax (Tencent#4333)

* pnnx save onnx zero (Tencent#4077)

* save foldable constants in file for reducing memory usage (Tencent#4337)

* match inplace slice copy pattern, rewrite copy uses (Tencent#4338)

* add vector optimization for loongarch64 (Tencent#4242)

* ci loongarch64 lsx (Tencent#4344)

* gridsample op support (Tencent#4288)



Co-authored-by: LRY89757 <LRY89757@users.noreply.github.com>
Co-authored-by: nihuini <nihuini@tencent.com>
Co-authored-by: nihui <shuizhuyuanluo@126.com>

* squeeze and expanddims 4d (Tencent#4346)

* implement MultiheadAttention kdim vdim (Tencent#4347)

* pnnx convert torch bitwise left_shift right_shift (Tencent#4349)

* pnnx fp16 option for ncnn and onnx weight type (Tencent#4350)

* pnnx fuse more function to module (Tencent#4351)

* pnnx fuse more function to module

* rename some pass name

* fuse adjacent reshape, fuse pad conv2d

* fuse pad conv1d

* split tests (Tencent#4354)

* Support mat.numpy() in Python (Tencent#4356)

* Fix typo in stb_image.h (Tencent#4358)

exitting -> exiting

* Fix windows-arm64 build for non-neon case (Tencent#4227)

* update release ci (Tencent#4359)

* update release ci

* find modern glslang

* parallel jobs on windows

* Fix c api allocator (Tencent#4360)

* add some c_api interfaces related to allocator setup.

* fix errors in allocator parameters in c_api.

* test c api allocator

Co-authored-by: zhangtongshe <yuyuyezi@vip.qq.com>

* update glslang (Tencent#4361)

* disable out-of-line atomics since ndk23+ for resolving linking issue with old ndk (Tencent#4362)

* I added one more project to the list of examples. (Tencent#4205)

* Dedicated to coloring black and white photographs.

* add example project link (Tencent#4365)

* fix(pybind11): build error (Tencent#4368)

* fix openmp affinity abort when cpu goes offline (Tencent#4370)

* Update release-python.yml

* small fixes

* unpack list input

* Remove LSTM2

* fix LSTM

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Menci <huanghaorui301@gmail.com>
Co-authored-by: luqiang guo <702572275@qq.com>
Co-authored-by: Lry89757 <77330637+LRY89757@users.noreply.github.com>
Co-authored-by: magicse <magicse@users.noreply.github.com>
Co-authored-by: Zhuo Zhang <imzhuo@foxmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: 汤圆奶昔 <47135403+tonori@users.noreply.github.com>
Co-authored-by: Xavier Hsinyuan <me@lstlx.com>
Co-authored-by: thelastlin <thelastlin@users.noreply.github.com>
Co-authored-by: nihui <shuizhuyuanluo@126.com>
Co-authored-by: 柚木鉉 <740291272@qq.com>
Co-authored-by: Zhang Ge <sjtu.zg123@gmail.com>
Co-authored-by: ZhangGe6 <ZhangGe6@users.noreply.github.com>
Co-authored-by: LinHe <LinHe.Lurking@gmail.com>
Co-authored-by: LinHeLurking <LinHeLurking@users.noreply.github.com>
Co-authored-by: nihuini <nihuini@tencent.com>
Co-authored-by: MisakaBit <MisakaBit@gmail.com>
Co-authored-by: LiuYi-Up <73060646+LiuYi-Up@users.noreply.github.com>
Co-authored-by: 陸 言 <robinluaa@outlook.com>
Co-authored-by: miemie2013 <53960695+miemie2013@users.noreply.github.com>
Co-authored-by: Eahow Chen <15228088+veahow@users.noreply.github.com>
Co-authored-by: veahow <veahow@users.noreply.github.com>
Co-authored-by: li mengyang <hwdefcom@outlook.com>
Co-authored-by: Yoh <wpz_yoh@163.com>
Co-authored-by: Caize Wu <zepanwucai@gmail.com>
Co-authored-by: bestpower <wangyu117136@gmail.com>
Co-authored-by: wangyu <786794414@qq.com>
Co-authored-by: shaoshengsong <30892500+shaoshengsong@users.noreply.github.com>
Co-authored-by: WuJinxuan <2456510228@qq.com>
Co-authored-by: junchao-loongson <68935141+junchao-loongson@users.noreply.github.com>
Co-authored-by: LRY89757 <LRY89757@users.noreply.github.com>
Co-authored-by: Ikko Ashimine <eltociear@gmail.com>
Co-authored-by: zhangtongshe <yuyuyezi@vip.qq.com>
Co-authored-by: tpoisonooo <khj.application@aliyun.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants